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Abstract: Algorithms in varied fields use the idea of maintaining a distribution over a 
certain set and use the multiplicative update rule to iteratively change these weights. Their 
analyses are usually very similar and rely on an exponential potential function. 

In this survey we present a simple meta-algorithm that unifies many of these disparate 
algorithms and derives them as simple instantiations of the meta-algorithm. We feel that 
since this meta-algorithm and its analysis are so simple, and its applications so broad, it 
should be a standard part of algorithms courses, like “divide and conquer.” 
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1 Introduction 


The Multiplicative Weights (MW) method is a simple idea which has been repeatedly discovered in fields 
as diverse as Machine Learning, Optimization, and Game Theory. The setting for this algorithm is the 
following. A decision maker has a choice of n decisions, and needs to repeatedly make a decision and 
obtain an associated payoff. The decision maker’s goal, in the long run, is to achieve a total payoff which 
is comparable to the payoff of that fixed decision that maximizes the total payoff with the benefit of 


*This project was supported by David and Lucile Packard Fellowship and NSF grants MSPA-MCS 0528414 and CCR- 
0205594. 


© 2012 Sanjeev Arora, Elad Hazan and Satyen Kale 
© Licensed under a Creative Commons Attribution License DOI: 10.4086/toc.2012.v008a006 


SANJEEV ARORA, ELAD HAZAN AND SATYEN KALE 


hindsight. While this best decision may not be known a priori, it is still possible to achieve this goal by 
maintaining weights on the decisions, and choosing the decisions randomly with probability proportional 
to the weights. In each successive round, the weights are updated by multiplying them with factors which 
depend on the payoff of the associated decision in that round. Intuitively, this scheme works because it 
tends to focus higher weight on higher payoff decisions in the long run. 


This idea lies at the core of a variety of algorithms. Some examples include: the Ada Boost algorithm 
in machine learning [26]; algorithms for game playing studied in economics (see references later), the 
Plotkin-Shmoys-Tardos algorithm for packing and covering LPs [56], and its improvements in the case 
of flow problems by Young [65], Garg-K6nemann [29, 30], Fleischer [24] and others; methods for 
convex optimization like exponentiated gradient (mirror descent), Lagrangian multipliers, and subgradient 
methods, Impagliazzo’s proof of the Yao XOR lemma [40], etc. The analysis of the running time uses a 
potential function argument and the final running time is proportional to 1 /¢?. 


It has been clear to several researchers that these results are very similar. For example Khandekar’s 
Ph. D. thesis [46] makes this point about the varied applications of this idea to convex optimization. 
The purpose of this survey is to clarify that many of these applications are instances of the same, more 
general algorithm (although several specialized applications, such as [53], require additional technical 
work). This meta-algorithm is very similar to the “Hedge” algorithm from learning theory [26]. Similar 
algorithms have been independently rediscovered in many other fields; see below. The advantage of 
deriving the above algorithms from the same meta-algorithm is that this highlights their commonalities as 
well as their differences. To give an example, the algorithms of Garg-K6nemann [29, 30] were felt to be 
quite different from those of Plotkin-Shmoys-Tardos [56]. In our framework, they can be seen as a clever 
trick for “width reduction” for the Plotkin-Shmoys-Tardos algorithms (see Section 3.4). 


We feel that this meta-algorithm and its analysis are simple and useful enough that they should 
be viewed as a basic tool taught to all algorithms students together with divide-and-conquer, dynamic 
programming, random sampling, and the like. Note that the multiplicative weights update rule may be 
seen as a “constructive” version of LP duality—equivalently, von Neumann’s minimax theorem in game 
theory—and it gives a fairly concrete method for competing players to arrive at a solution/equilibrium 
(see Section 3.2). This may be an appealing feature in introductory algorithms courses, since the standard 
algorithms for LP such as simplex, ellipsoid, or interior point lack such a game-theoretic interpretation. 
Furthermore, it is a convenient stepping point to many other topics that rarely get mentioned in algorithms 
courses, including online algorithms (see the basic scenario in Section 1.1) and machine learning. 
Furthermore our proofs seem easier and cleaner than the entropy-based proofs for the same results in 
machine learning (although the proof technique we use here has been used before, see for example Blum’s 
survey [10]). 


The current paper is chiefly a survey. It introduces the main algorithm, gives a few variants (mostly 
having to do with the range in which the payoffs lie), and surveys the most important applications—often 
with complete proofs. Note however that this survey does not cover all applications of the technique, as 
several of these require considerable additional technical work which is beyond the scope of this paper. 
We have provided pointers to some such applications which use the multiplicative weights technique at 
their core without going into more details. There are also a few small results that appear to be new, such 
as the variant of the Garg-K6nemann algorithm in Section 3.4 and the lower bound in Section 4. 
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Related work. An algorithm similar in flavor to the Multiplicative Weights algorithm was proposed in 
game theory in the early 1950’s [13, 12, 59]. Following Brown [12], this algorithm was called “Fictitious 
Play”: at each step each player observes actions taken by his opponent in previous stages, updates his 
beliefs about his opponents’ strategies, and chooses the pure best response against these beliefs. In the 
simplest case, the player simply assumes that the opponent is playing from a stationary distribution and 
sets his current belief of the opponent’s distribution to be the empirical frequency of the strategies played 
by the opponent. This simple idea (which was shown to lead to optimal solutions in the limit in various 
cases) led to numerous developments in economics, including Arrow-Debreu General Equilibrium theory 
and more recently, evolutionary game theory. Grigoriadis and Khachiyan [33] showed how a randomized 
variant of “Fictitious Play” can solve two player zero-sum games efficiently. This algorithm is precisely 
the multiplicative weights algorithm. It can be viewed as a soft version of fictitious play, when the player 
gives higher weight to the strategies which pay off better, and chooses her strategy using these weights 
rather than choosing the best response strategy. 


In Machine Learning, the earliest form of the multiplicative weights update rule was used by Little- 
stone in his well-known Winnow algorithm [50, 51]. It is somewhat reminiscent of the older perceptron 
learning algorithm of Minsky and Papert [55]. The Winnow algorithm was generalized by Littlestone and 
Warmuth [52] in the form of the Weighted Majority algorithm, and later by Freund and Schapire in the 
form of the Hedge algorithm [26]. We note that most relevant papers in learning theory use an analysis 
that relies on entropy (or its cousin, Kullback-Leibler divergence) calculations. This analysis is closely 
related to our analysis, but we use exponential functions instead of the logarithm, or entropy, used in 
those papers. The underlying calculation is the same: whereas we repeatedly use the fact that e* ~ 1 +x 
when |x| is small, they use the fact that ln(1 +x) ~x. We feel that our approach is cleaner (although 
the entropy based approach yields somewhat tighter bounds that are useful in some applications, see 
Section 2.2). 


Other applications of the multiplicative weights algorithm in computational geometry include Clark- 
son’s algorithm for linear programming with a bounded number of variables in linear time [20, 21]. 
Following Clarkson, Bronnimann and Goodrich use similar methods to find Set Covers for hypergraphs 
with small VC dimension [11]. 


The weighted majority algorithm as well as more sophisticated versions have been independently 
discovered in operations research and statistical decision making in the context of the On-line decision 
problem; see the surveys of Cover [22], Foster and Vohra [25], and also Blum [10] who includes 
applications of weighted majority to machine learning. A notable algorithm, which is different from but 
related to our framework, was developed by Hannan in the 1950’s [34]. Kalai and Vempala showed how 
to derive efficient algorithms via methods similar to Hannan’s [43]. We show how Hannan’s algorithm 
with the appropriate choice of parameters yields the multiplicative update decision rule in Section 3.8. 


Within computer science, several researchers have previously noted the close relationships between 
multiplicative update algorithms used in different contexts. Young [65] notes the connection between 
fast LP algorithms and Raghavan’s method of pessimistic estimators for derandomization of randomized 
rounding algorithms; see our Section 3.5. Klivans and Servedio [49] relate boosting algorithms in learning 
theory to proofs of Yao’s XOR Lemma; see our Section 3.6. Garg and Khandekar [28] describe a common 
framework for convex optimization problems that contains Garg-Ko6nemann and Plotkin-Shmoys-Tardos 
as subcases. 
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To the best of our knowledge our framework is the most general and, arguably, the simplest. We 
readily acknowledge the influence of all previous papers (especially Young [65] and Freund-Schapire [27]) 
on the development of our framework. We emphasize again that we do not claim that every algorithm 
designed using the multiplicative update idea fits in our framework, just that most do. 


Paper organization. We proceed to define the illustrative weighted majority algorithm in this section. 
In Section 2 we describe the general MW meta-algorithm, followed by numerous and varied applications 
in Section 3. In Section 4 we give lower bounds, followed by the more general matrix MW algorithm in 
Section 5. 


1.1 The weighted majority algorithm 


Now we briefly illustrate the weighted majority algorithm in a simple and concrete setting, which will 
naturally lead to our generalized meta-algorithm. This is known as the Prediction from Expert Advice 
problem. 

Imagine the process of picking good times to invest in a stock. For simplicity, assume that there is a 
single stock of interest, and its daily price movement is modeled as a sequence of binary events: up/down. 
(Below, this will be generalized to allow non-binary events.) Each morning we try to predict whether the 
price will go up or down that day; if our prediction happens to be wrong we lose a dollar that day, and if 
it’s correct, we lose nothing. 

The stock movements can be arbitrary and even adversarial. To balance out this pessimistic 
assumption, we assume that while making our predictions, we are allowed to watch the predictions of n 
“experts.” These experts could be arbitrarily correlated, and they may or may not know what they are 
talking about. The algorithm’s goal is to limit its cumulative losses (i. e., bad predictions) to roughly the 
same as the best of these experts. At first sight this seems an impossible goal, since it is not known until 
the end of the sequence who the best expert was, whereas the algorithm is required to make predictions 
all along. 

Indeed, the first algorithm one thinks of is to compute each day’s up/down prediction by going with 
the majority opinion among the experts that day. But, this algorithm doesn’t work because a majority of 
experts may be consistently wrong on every single day. 

The weighted majority algorithm corrects the trivial algorithm. It maintains a weighting of the experts. 
Initially all have equal weight. As time goes on, some experts are seen as making better predictions than 
others, and the algorithm increases their weight proportionately. The algorithm’s prediction of up/down 
for each day is computed by going with the opinion of the weighted majority of the experts for that day. 


Theorem 1.1. After T steps, let mj") be the number of mistakes of expert i and MC? be the number of 
mistakes our algorithm has made. Then we have the following bound for every i: 


MO) < 211+ n)m, + = 


In particular, this holds for i which is the best expert, i. e., having the least mj"). 
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Weighted majority algorithm 
Initialization: Fix an 7) < 5. With each expert i, associate the weight w; := 1. 
For ¢ = 1,2,...,T: 


1. Make the prediction that is the weighted majority of the experts’ predictions based on the 
weights w;"),...,w,'). That is, predict “up” or “down” depending on which prediction has a 
higher total weight of experts advising it (breaking ties arbitrarily). 


2. For every expert i who predicts wrongly, decrease his weight for the next round by multiplying 
it by a factor of (1 — n): 


wilt) = (1—n)w; (update rule). (1.1) 


Remark When m; >> (2/n)Inn we see that the number of mistakes made by the algorithm is 
bounded from above by roughly 2(1 + nym), i. e., approximately twice the number of mistakes made 
by the best expert. This is tight for any deterministic algorithm. However, the factor of 2 can be removed 
by substituting the above deterministic algorithm by a randomized algorithm that predicts according to 
the majority opinion with probability proportional to its weight. (In other words, if the total weight of 
the experts saying “up” is 3/4 then the algorithm predicts “up” with probability 3/4 and “down” with 
probability 1/4.) Then the number of mistakes after T steps is a random variable and the claimed upper 
bound holds for its expectation (see Section 2 for more details). 


Proof. A simple induction shows that wt!) = (1 — nym”, Let $C) = yw; (“the potential function”). 
Thus &\) = n. Each time we make a mistake, the weighted majority of experts also made a mistake, so at 
least half the total weight decreases by a factor 1 — n. Thus, the potential function decreases by a factor 
of at least (1 — 9/2): 


at) < of (5+30-m) = O (1-7/2). 


Thus simple induction gives ®7+)) < n(1— 1/2)". Finally, since P+) > w;(7+ for all i, the 
claimed bound follows by comparing the above two expressions and using the fact that 


—In(l—n) <n +n? 


since ņ < 1/2. 


The beauty of this analysis is that it makes no assumption about the sequence of events: they 
could be arbitrarily correlated and could even depend upon our current weighting of the experts. In 
this sense, this algorithm delivers more than initially promised, and this lies at the root of why (after 
obvious generalization) it can give rise to the diverse algorithms mentioned earlier. In particular, the 
scenario where the events are chosen adversarially resembles a zero-sum game, which we consider later 
in Section 3.2. 
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2 The Multiplicative Weights algorithm 


In the general setting, we have a set of n decisions and in each round, we are required to select one 
decision from the set. In each round, each decision incurs a certain cost, determined by nature or an 
adversary. All the costs are revealed after we choose our decision, and we incur the cost of the decision 
we chose. For example, in the prediction from expert advice problem, each decision corresponds to a 
choice of an expert, and the cost of an expert is 1 if the expert makes a mistake, and 0 otherwise. 

To motivate the Multiplicative Weights (MW) algorithm, consider the naive strategy that, in each 
iteration, simply picks a decision at random. The expected penalty will be that of the “average” decision. 
Suppose now that a few decisions are clearly better in the long run. This is easy to spot as the costs are 
revealed over time, and so it is sensible to reward them by increasing their probability of being picked in 
the next round (hence the multiplicative weight update rule). 

Intuitively, being in complete ignorance about the decisions at the outset, we select them uniformly 
at random. This maximum entropy starting rule reflects our ignorance. As we learn which ones are the 
good decisions and which ones are bad, we lower the entropy to reflect our increased knowledge. The 
multiplicative weight update is our means of skewing the distribution. 

We now set up some notation. Let t = 1,2,...,7 denote the current round, and let i be a generic 
decision. In each round f, we select a distribution pË ) over the set of decisions, and select a decision i 
randomly from it. At this point, the costs of all the decisions are revealed by nature in the form of the 
vector m“) such that decision i incurs cost m;®). We assume that the costs lie in the range [—1, 1]. This is 
the only assumption we make on the costs; nature is completely free to choose the cost vector as long 
as these bounds are respected, even with full knowledge of the distribution that we choose our decision 
from. 

The expected cost to the algorithm for sampling a decision i from the distribution p™ is 


Expo li”) = m).p, 


The total expected cost over all rounds is therefore De mî) - p). Just as before, our goal is to design 
an algorithm which achieves a total expected cost not too much more than the cost of the best decision 
in hindsight, viz. min; yy m;). Consider the following algorithm, which we call the Multiplicative 
Weights Algorithm. This algorithm has been studied before as the prod algorithm of Cesa-Bianchi, 
Mansour, and Stoltz [17], and Theorem 2.1 can be seen to follow from Lemma 2 in [17]. 


The following theorem—completely analogous to Theorem 1.1—bounds the total expected cost of 
the Multiplicative Weights algorithm (given in Figure 1) in terms of the total cost of the best decision: 


Theorem 2.1. Assume that all costs mj) € [-1,1] and n < 1/2. Then the Multiplicative Weights 
algorithm guarantees that after T rounds, for any decision i, we have 


T T T Inn 
yi m).p® < Ye ml +n Y p+ 


t=1 t=1 t=1 
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Multiplicative Weights algorithm 


Initialization: Fix an 7) < 5. With each decision i, associate the weight w;‘!) := 1. 
For t = 1,2,...,T: 


1. Choose decision i with probability proportional to its weight w;" ). I.e., use the distribution 
over decisions p® = {w,/®,...,w, /@} where PO = yw;. 


2. Observe the costs of the decisions m“). 


3. Penalize the costly decisions by updating their weights as follows: for every decision i, set 


wilt) = wO (1 —nm;") (2.1) 


Figure 1: The Multiplicative Weights algorithm. 


Proof. The proof is along the lines of the earlier one, using the potential function pO = Fiw: 
p+!) = wt) 
2 


= Vwi! (1 — nm; 


i 


PO -N80 Ym, pi 


p exp(—nm" . p). 


IA 


Here, we used the fact that p;© = w, / ®). Thus, by induction, after T rounds, we have 


T T 
T+!) x p exp (= Em” »°) = n-exp (-n Em” w”) g (2.2) 
t=1 


Next we use the following facts, which follow immediately from the convexity of the exponential 
function: 


(=n) <(1-nx) ifxe [0,1], 
(l+n)* <(1—nx) ifx € [1,0]. (2.3) 


Since m;®) € [—1, 1], we have for every decision i, 


PHD > wT) = T](1— nmi) > (1-220 tq) Eom”, (2.4) 


t<T 
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where the subscripts “> 0” and “< 0” in the summations refer to the rounds t where m;t ) is >Oand <0 
respectively. Taking logarithms in equations (2.2) and (2.4) we get: 


T 
Inn- n ym” p> Em” In(1— n) -E m” In(1 +n). 
t=1 >0 <0 


Negating, rearranging, and scaling by 1/17: 


t Inn 1 1 1 
OO < PM ly op, ! iF mO 
m p SS mM; n + mM; n(1 +n) 
d n 2 1-n 1 
Inn 1 1 
Ste Le ey me) 
no 1S n z 
lnn L 
— (t) (t) (t) 
= — +) m’+n) m| -—n} mi 
T T 
= ym a: 
n 7] i=l 


In the second inequality we used the facts that 


m( y) snn and In(1+n)>n-n? (2.5) 


for n < 1/2. 


Corollary 2.2. The Multiplicative Weights algorithm also guarantees that after T rounds, for any 
distribution p on the decisions, 


(mO +n) -p+ EE, 
1 


Mns 


T 
t=1 


t 


where |m')| is the vector obtained by the taking the coordinate-wise absolute value of m, 


Proof. This corollary follows immediately from Theorem 2.1, by taking a convex combination of the 
inequalities for all decisions i with the distribution p. 


2.1 Updating with exponential factors: the Hedge algorithm 


In our description of the MW algorithm, the update rule uses multiplication by a linear function of the 
cost (specifically, (1 — nm" )) for expert i). In several other incarnations of MW algorithm, notably the 
Hedge algorithm of Freund and Schapire [26], an exponential factor is used instead. This update rule is 
the following: 

wt) = w® -exp(—nm;'”) . (2.6) 


As can be seen from the analysis of the MW algorithm, Hedge is not very different. The bound we obtain 
for Hedge is slightly different however. While most of the applications we present in the rest of the paper 
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can be derived using Hedge as well with a little extra calculation, some applications, such as the ones in 
Sections 3.3 and 3.5, explicitly need the MW algorithm rather than Hedge to obtain better bounds. Here, 
we state the bound obtained for Hedge without proof—the analysis is on the same lines as before. The 
only difference is that instead of the inequalities (2.3), we use the inequality 


elne = lnt e 
if |nx| < 1. 


Theorem 2.3. Assume that all costs mj) € [—1,1] and n < 1. Then the Hedge algorithm guarantees 
that after T rounds, for any decision i, we have 


T T T : ine 
t=1 t=1 t=1 


Here, (m\))? is the vector obtained by taking coordinate-wise square of m"). 


This guarantee is very similar to the one in Theorem 2.1, with one important difference: the term 
multiplying 7) is a loss which depends on the algorithm’s distribution. In Theorem 2.1, this additional 
term depends on the loss of the best decision in hindsight. For some applications the latter guarantee is 
stronger (see Section 3.3). 


2.2 Proof via KL-divergence 


In this section, we give an alternative proof of Theorem 2.1 based on the Kullback-Leibler (KL) divergence, 
or relative entropy. While this proof is somewhat more complicated, it gives a good insight into why the 
MW algorithm works: the reason is that it tends to reduce the KL-divergence to the optimal solution. 
Another reason for giving this proof is that it yields a more nuanced form of the MW algorithm that is 
useful in some applications (such as the construction of hard-core sets, see Section 3.7). Readers may 
skip this section without loss of continuity. 

For two distributions p and q on the decision set, the relative entropy between them is 


P 
RE(p || q) =La 
i l 


where the term p;ln(p;/qi) is defined to be zero if p; = 0 and infinite if p; 40 , qi = 0. 

Consider the following twist on the basic decision-making problem from Section 2. Fix a convex 
subset of distributions over decisions, P (note: the basic setting is recovered when ? is the set of all 
distributions). In each round ft, the decision-maker is required to produce a distribution p) € P. At that 
point, the cost vector m is revealed and the decision-maker suffers cost m” - p). Since we make the 
restriction that p“ € P, we now want to compare the total cost of the decision-maker to the cost of the 
best fixed distribution in P. Consider the algorithm in Figure 2. 

Note that in the special case when F is the set of all distributions on the decisions, this algorithm is 
exactly the basic MW algorithm presented in Figure 1. The relative entropy projection step ensures that 


THEORY OF COMPUTING, Volume 8 (2012), pp. 121-164 129 


SANJEEV ARORA, ELAD HAZAN AND SATYEN KALE 


Multiplicative Weights Update algorithm with Restricted Distributions 


Initialization: Fix an < 5. Set p")) to be be an arbitrary distribution in initialized to 1. 
For t = 1,2,...,T 


1. Choose decision i by sampling from p°. 
2. Observe the costs of the decisions m™. 


3. Compute the probability vector p+!) using the usual multiplicative update rule: for every 
expert i, 
pitt) = pA n nm:®)/® (2.7) 


where ©) is the normalization factor to make pt +1) a distribution. 


4. Set p’* to be the relative entropy projection of p' on the set P, i.e., 


pít) = argminRE(p || p”). 
pE? 


Figure 2: The Multiplicative Weights algorithm with Restricted Distributions. 


we always choose a distribution in P. This projection is a convex program since relative entropy is convex 
and P is a convex set, and hence can be computed using standard convex programming techniques. 

We now prove a bound on the total cost of the algorithm (compare to Corollary 2.2). Note that in 
the basic setting when ? is the set of all distributions, the bound given below is tighter than the one in 
Theorem 2.1. 


Theorem 2.4. Assume that all costs m;®) € [—1,1] and n < 1/2. Then the Multiplicative Weights 
algorithm with Restricted Distributions guarantees that after T rounds, for any p € F, we have 


T (1) 
RE(p || p 
Ym p< Y (mnm) p EEEL, 
where |m®)] is the vector obtained by taking the coordinate-wise absolute value of m” 


Proof. We use the relative entropy between p and p“), RE(p || p®) := X; piln(p;/pi) as a “potential” 
function. We have 


att 
RE(p || P!) — RE(p || p” = Leila T 
p) 
= eam) nee 
< In i m;® +In(1+ n) $ pi; +n” 
— <0 


n(m” + njm|)-p+ino, 


IA 
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The first inequality above follows from (2.3), and the second follows from (2.5). Next, we have 


since In(1 — x) < —x for x < 1. Thus, we get 


RE(p || P+!) — RE(p || p®) < n(m” +n\m|)-p— n(m” -p®). 


This inequality essentially says that if the cost of the algorithm in round t, m” - p™, is significantly larger 
than the cost of the comparator, m” - p, then p‘t! moves closer to p (in the relative entropy distance) 
than p”. 

Now, projection on the set ? using the relative entropy as a distance function is a Bregman projection, 
and thus it satisfies the following Generalized Pythagorean inequality (see, e. g., [39]), for any p € P: 


RE(p || p’t)) + RE? || pOT)) < RE(p || pr”). 


I.e., the projection step only brings the distribution closer to p. Since relative entropy is always non- 
negative, we have RE(p"+?) || p+!) > 0 and so 


RE(p || p“) —RE(p || p) < n(m” + n\m))-p—n(m -p®). 


Summing up from t = 1 to T, dividing by n, and simplifying using the fact that RE(p || p+») is 
non-negative, we get the stated bound. 


2.3 Gains instead of losses 


There are situations where it makes more sense for the vector ml” to specify gains for each expert rather 
than losses. Now our goal is to get as much total expected payoff as possible in comparison to the total 
payoff of the best expert. We can get an algorithm for this case simply by running the Multiplicative 
Weights algorithm using the cost vector —m" ), 

The resulting algorithm is identical, and the following theorem follows directly from Theorem 2.1 by 
simply negating the quantities: 
Theorem 2.5. Assume that all gains mj) € [—1,1] and n < 1. Then the Multiplicative Weights algorithm 
(for gains) guarantees that after T rounds, for any expert i, we have 


T T T inz 
Em” -p® > Em” -n £ \m;)| — — 
j= i=l i=l n 


We also have the following immediate corollary, corresponding to Corollary 2.2: 


Corollary 2.6. The Multiplicative Weights algorithm also guarantees that after T rounds, for any 
distribution p on the decisions, 


Inn 
Ym p” > = Lm nenas 


where |Imt )| is the vector obtained by the taking the coordinate-wise absolute value of m®”) 
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3 Applications 


Typically, the Multiplicative Weights method is applied in the following manner. A prototypical example 
is to solve a constrained optimization problem. We then let a decision represent each constraint in the 
problem, with costs specified by the points in the domain of interest. For a given point, the cost of a 
decision is made proportional to how well the corresponding constraint is satisfied on the point. This 
might seem counterintuitive, but recall that we reduce a decision’s weight depending on its penalty, and if 
a constraint is well satisfied on points so far we would like its weight to be smaller, so that the algorithm 
focuses on constraints that are poorly satisfied. 

In many applications (though not all) the choice of point is also under our control. Typically we will 
need to generate the maximally adversarial point, i.e., the point that maximizes the expected cost. Then 
the overall algorithm consists of two subprocedures: an “oracle” for generating the maximally adversarial 
point at each step, and the MW algorithm for updating the weights of the decisions. With this intuition, 
we can describe the following applications. 


3.1 Learning a linear classifier: the Winnow algorithm 


To the best of our knowledge, the first time multiplicative weight updates were used was in the Winnow 
algorithm of Littlestone [50]. This is an algorithmic technique used in machine learning to learn linear 
classifiers. Equivalently, this can also be seen as solving a linear program. 

The setup is as follows. We are given m labeled examples, (aj, 1), (a2, 42),.--,(@m,m) where 
a; € R” are feature vectors, and £; € {—1,1} are their labels. Our goal is to find non-negative weights 
such that for any example, the sign of the weighted combination of the features matches its labels, i. e., 
find x € R” with x; > 0 such that for all j = 1,2,...,m, we have sgn(a;-x) = £j. Equivalently, we require 
that ¢;a;-x > 0 for all j. Without loss of generality, we may assume that the weights sum to | so that 
they form a distribution, i. e., 1-x = 1, where 1 is the all 1’s vector. 

Thus, for notational convenience, if we redefine a; to be ¢;a;, then the problem reduces to finding a 
solution to the following LP: 


Vj=1,2,...,m: ay-x 0, 
1-x = 1, (3.1) 
Vi: Xi > 0. 


IV 


Note this is a quite general form of LP, and many commonly seen LPs can be reduced to this form. 

Now suppose there is a large-margin solution to this problem. I. e., there is an € > 0 and a distribution 
x” so that for all j, we have a; -x* > €. We now give an algorithm based on MW to solve the LP above. 
Define p = max | ||a;||... 

We run the MW algorithm in the gain form (see Section 2.3) with n = €/(2p). The decisions are 
given by the n features, and gains are specified by the m examples. The gain for feature i for example j is 
a;;/p. Note that these gains lie in the range [—1, 1] as required. 

In each round ¢, let x to be the distribution p generated by the MW algorithm. Now, we look for a 
misclassified example, i. e., an example j such that a;-x < 0. If no such constraint exists, we are done 
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and we can stop. Otherwise, if j is a misclassified example, then it specifies the gains for round t. Note 
that the gain in round t is m? -p®) = (1/p)aj;-x < 0, whereas for the solution x*, we have 


ee 
P P 
We keep running the MW algorithm until we find a good solution (i. e., one that classifies all examples 
correctly). 
To get a bound on the number of iterations until we find a good solution, we apply Corollary 2.6 with 
p = x*. Using the trivial bound n|m®)| -p < n, we have 


-e 


£ 4 Inn E 2plnn 
0 pO > (t) y. > T 
oe p“ 2 Da nim“ |)-p n = 2p ; 


which implies that T < 4p? In(n)/e€?. Thus, in at most [4p* In() /e€?] iterations, we find a good solution. 


3.2 Solving zero-sum games approximately 


We show how our general algorithm above can be used to approximately solve zero-sum games. (This is a 
duplication of the results of Freund and Schapire [27], who gave the same algorithm but a different proof 
of convergence that used KL-divergence. Furthermore, convergence of simple algorithms to zero-sum 
game equilibria were studied earlier in [34].) 

Let A be the payoff matrix of a finite 2-player zero-sum game, with n rows (the number of columns 
will play no role). When the row player plays strategy i and the column player plays strategy j, then the 
payoff to the column player is A(i, j) := A;j. We assume that A(i, j) € [0, 1]. If the row player chooses 
his strategy i from a distribution p over the rows, then the expected payoff to the column player for 
choosing a strategy j is A(p, j) := Ejep[A(i, j)]. Thus, the best response for the column player is the 
strategy j which maximizes this payoff. Similarly, if the column player chooses his strategy j from a 
distribution q over the columns, then the expected payoff he gets if the row player chooses the strategy i is 
A(i,q) := Ejeq|A(i, j)]. Thus, the best response for the row player is the strategy i which minimizes this 
payoff. John von Neumann’s min-max theorem says that if each of the players chooses a distribution over 
their strategies to optimize their worst case payoff (or payout), then the value they obtain is the same: 


minmaxA(p,j) = maxminA(i,q) G2) 
P J q i 


where p (resp., q) varies over all distributions over rows (resp., columns). Also, i (resp., j) varies over all 
rows (resp., columns). The common value of these two quantities, denoted A*, is known as the value of 
the game. 

Let £ > 0 be an error parameter. We wish to approximately solve the zero-sum game up to additive 
error of €, namely, find mixed row and column strategies p and q such that 


maxA(p,j) < A* +e. (3.4) 


f = 
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The algorithmic assumption about the game is that given any distribution p on decisions, we have an 
efficient way to pick the best response, namely, the pure column strategy j that maximizes A(p, j). This 
quantity is at least A* from the definition above. Call this algorithm the ORACLE. 


Theorem 3.1. Given an error parameter € > Q, there is an algorithm which solves the zero-sum game up 
to an additive factor of € using O(log(n) /€*) calls to ORACLE, with an additional processing time of 
O(n) per call. 


Proof. We map our general algorithm from Section 2 to this setting by considering (3.3) as specifying n 
linear constraints on the probability vector q: viz., for all rows i, A(i,q) > A* — €. Now, following the 
intuition given in the beginning of this section, we make our decisions to correspond to pure strategies 
of the row player. Thus a distribution on the decisions corresponds to a mixed row strategy. Costs of 
the decisions are specified by pure strategies of the column player. The cost paid by a decision i when 
column player chooses strategy j is A(i, j). 

In each round, given a distribution p”) on the rows, we will choose the column j (*) to be the best 
response strategy to p” for the column player, by calling ORACLE. Thus, the cost vector m is the 
j-th column of the matrix A. 

Since all A(i, j) € [0,1], we can apply Corollary 2.2 to get that after T rounds, for any distribution on 
the rows p, we have 


T iy 
l 
ŁA”, 7) < (+n) YA ja 
i=l t= n 
Dividing by T, and using the fact that A (p, j®) < 1 and that for all t, A(p, j) > A*, we get 


Setting p = p*, the optimal row strategy, we have A(p, j) < A* for any j. By setting n = €/2 and 
= [4In(n)/e7], we get that 


T T 
a < FY Ap,j < z ÈA j+e < Atte. 6.5) 


Thus, FEL A(p“ ) j) is an (additive) -approximation to A*. 
Let 7 be the round ¢ with the minimum value of A(p, j 6) ). We have, from (3.5), 


1 T 
A(p, jO) < Z EAP®,JO) < Ate. 


Since j‘ maximizes A(p D, ne over all J we conclude that p® is an approximately optimal mixed row 
strategy, and thus we can set p* = f) J 


l Alternatively, we can set p* = =(1/T) E, pl . For let j* be the optimal column player response to p*. Then we have 


7 LA je 7 LAR”, jO) <A" +e. 
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We set q“ to be the distribution which assigns to column j the probability 
Hr: j” = j} 
7 ; 


From (3.5), for any row strategy i, by setting p to be concentrated on the pure strategy i, we have 


T 
ae < FY Al I) = Alia") 


t=1 


which shows that q* is an approximately optimal mixed column strategy. 


3.3 Plotkin, Shmoys, Tardos framework for packing/covering LPs 


Plotkin, Shmoys, and Tardos [56] generalized some known flow algorithms to a framework for approxi- 
mately solving fractional packing and covering problems, which are a special case of linear programming 
formally defined below. Their algorithm is a quantitative version of the classical Lagrangean relaxation 
idea, and applies also to general linear programs. Below, we derive the algorithm for general LPs and then 
mention the slight modification that yields better running time for packing-covering LPs. Also, we note 
that we could derive this algorithm as a special case of game solving, but for concreteness we describe it 
explicitly. 
The basic problem is to check the feasibility of the following convex program: 


?Xx € P: Ax > b, (3.6) 


where A € R”*” is an m x n matrix, x € R”, and FẸ is a convex set in R”. Intuitively, the set P represents 
the “easy” constraints to satisfy, such as non-negativity, and A represents the “hard” constraints to satisfy. 
We wish to design an algorithm that, given an error parameter € > 0, either solves the problem to 
an additive error of £, i. e., finds an x € P such that for all i, A;x > b; — €, or failing that, proves that the 
system is infeasible. Here, A; is the ith row of A. 
We assume the existence of an algorithm, called ORACLE, which, given a probability vector p on the 
m constraints, solves the following feasibility problem: 


xEP: pl Ax > pb. (3.7) 


One way to implement this procedure is by maximizing p' Ax over x € P. It is reasonable to expect such 
an optimization procedure to exist (indeed, such is the case for many applications) since we only need to 
check the feasibility of one constraint rather than m. If the feasibility problem (3.6) has a solution x*, 
then the same solution also satisfies (3.7) for any probability vector p over the constraints. Thus, if there 
is a probability vector p over the constraints such that no x € P satisfies (3.7), then it is proof that the 
original problem is infeasible. 

We assume that the ORACLE satisfies the following technical condition, which is necessary for 
deriving running time bounds. 
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Definition 3.2. An (£, p )-bounded ORACLE, for parameters 0 < £ < p, is an algorithm which given a 
probability vector p over the constraints, solves the feasibility problem (3.7). Furthermore, there is a fixed 
subset J C [m] of constraints such that whenever the ORACLE manages to find a point x € F satisfying 
(3.7), the following holds: 


Viel: A,x—b; € [-£,p], 
Vi¢gI: Ax—b; € [-p,@). 


The value p is called the width of the problem. 


In previous work, such as [56], only (p,p )-bounded ORACLEs are considered. We separate out the 
upper and lower bounds in order to obtain tighter guarantees on the running time. The results of [56] can 
be recovered simply by setting £ = p. 


Theorem 3.3. Let € > 0 be a given error parameter. Suppose there exists an (£, p )-bounded ORACLE 
for the feasibility problem (3.7). Assume that £ > €/2. Then there is an algorithm which either finds an x 
such that Vi . Aix > bi — €, or correctly concludes that the system is infeasible. The algorithm makes only 
O(£p log(m) /€*) calls to the ORACLE, with an additional processing time of O(m) per call. 


Proof. The condition £ > €/2 is only technical, and if it is not met we can just redefine £ to be €/2. To 
map our general framework to this situation, we have a decision representing each of the m constraints. 
Costs are determined by points x € P. The cost of constraint i for point x is (1/p) {Aix — b;] (so that the 
costs lie in the range {[—1, 1]). 

In each round z, given a distribution over the decisions (i. e., the constraints) p, we run the ORACLE 
with p“). If the ORACLE declares that there is no x € P such that p” "ax > ptt )' b, then we stop, because 
now p° is proof that the problem (3.6) is infeasible. 

So let us assume that this doesn’t happen, i. e., in all rounds t, the ORACLE manages to find a solution 
x) such p©' Ax > ptt ) "b. Since the cost vector to the Multiplicative Weights algorithm is specified to 
be m”) := (1/p) [Ax — b], we conclude that the expected cost in each round is non-negative: 


npl [Ax —b]-p® _ slp!) Ax—p h] > 0. 


Let i € J. Then Theorem 2.1 tells us that after T rounds, 


Inm 


n 


Li 
L 


t=1 


(= 
IA 


T 
Ax” -b +n y : [A x0 bi| 4 
t=1 


T 
= (140) Vax —5)42n F Axt — 5,4 E 
121P <0 P n 
| 2nl_, Inm 
< 1+” Ax bil + T+ š 
( p3 a ] 7 a 


Here, the subscript “< 0” refers to the rounds t when A;x®) — b; < 0. The last inequality follows because 
if Axx — b; < 0, then |Ajx") — b;| < £. Dividing by T, multiplying by p, and letting X = (1/T) £; x 
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(note that x € F since P is a convex set), we get that 


pin(m) 


< x — bi] 4 
0 < (1+n)[AX—b] +22 n 


Now, if we choose n = £€/(4£) (note that n < 1/2 since ¢ > €/2), and T = [8¢p In(m)/£?], we get that 
0< (1+n)[A;ž-— bi] +e => Ax > Dj-€. 


Reasoning similarly for i ¢ I, we get the same inequality. Putting both together, we conclude that x 
satisfies the feasibility problem (3.6) up to an additive € factor, as desired. 


3.3.1 Concave constraints 


The algorithm of Section 3.3 works not just for linear constraints over a convex domain, but also for 
concave constraints. Imagine that we have the following feasibility problem: 


AxeP: Wie lm: fi(x)>0 (3.8) 


where, as before, P € R” is a convex domain, and for i € [m], f; : P — R are concave functions. We wish 
to satisfy this system approximately, up to an additive error of €. Again, we assume the existence of 
an ORACLE, which, when given a probability distribution p = (p1, p2,...,Pm) |, solves the following 
feasibility problem: 


IxEP: $ pifi(x) >0. (3.9) 


An ORACLE would be called (¢,p)-bounded there is a fixed subset of constraints J C [m] such that 
whenever it returns a feasible solution x to (3.9), all constraints i € J take values in the range [—£,p] on 
the point x, and all the rest take values in [—p, 4]. 


Theorem 3.4. Let € > 0 be a given error parameter. Suppose there exists an (£, p )-bounded ORACLE 
for the feasibility problem (3.8). Assume that £ > €/2. Then there is an algorithm which either solves the 
problem up to an additive error of €, or correctly concludes that the system is infeasible, making only 
O(£p log(m)/€*) calls to the ORACLE, with an additional processing time of O(m) per call. 


Proof. Just as before, we have a decision for every constraint, and costs are specified by points x € P. 
The cost of constraint i for point x is (1/p) f;(x). 

Now we run the Multiplicative Weights algorithm with this setup. Again, if at any point the ORACLE 
declares that (3.9) is infeasible, we immediately halt and declare the system (3.8) infeasible. So assume 
this never happens. Then as before, the expected cost in each round is m”) . p > 0. Now, applying 
Theorem 2.1 as before, we conclude that for any i € I, we have 


L1 2ne Inm 
0 < (1+ — f(x) ET + —. 
( Us (x) > ; 


Dividing by T, multiplying by p, and letting x = (1/7) Z% x (note that 3 € F since P is a convex set), 
we get that 


l 
0 < (+n)fi(k) +2124 Pa, 
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since 
LS et) Ly i) 
= fix”) f a x“ , 


by Jensen’s inequality, since all the f; are concave. 
Now, if we choose n = £/4£ (note that n < 1/2 since £ > €/2), and T = [8¢p In(m)/£°], we get that 


0 < (l+n)fi(x)+e => fi(x) = -€. 


Reasoning similarly for i ¢ I, we get the same inequality. Putting both together, we conclude that x 
satisfies the feasibility problem (3.8) up to an additive € factor, as desired. 


3.3.2 Approximate ORACLEs 


The algorithm described in the previous section allows some slack for the implementation of the ORACLE. 
This slack is very useful in designing efficient implementations for the ORACLE. 

Define a €-approximate ORACLE for the feasibility problem (3.6) to be one that solves the feasibility 
problem (3.7) up to an additive error of £. That is, given a probability vector p on the constraints, either it 
finds an x € F such that p' Ax > p! b — g, or it declares correctly that (3.7) is infeasible. 


Theorem 3.5. Let € > 0 be a given error parameter. Suppose there exists an (¢,p)-bounded €/3- 
approximate ORACLE for the feasibility problem (3.6). Assume that £ > €/3. Then there is an algorithm 
which either solves the problem up to an additive error of £, or correctly concludes that the system is 
infeasible, making only O(£p log(m) /€”) calls to the ORACLE, with an additional processing time of 
O(m) per call. 


Proof. We run the algorithm of the previous section with the given ORACLE, setting n = €/6¢. Now, in 
every round, the expected payoff is at least —€/3p. Simplifying as before, we get that after T rounds, we 
have, the average point x = (1/T) £7; x" returned by the ORACLE satisfies 


pin(m) 


E€ 
< YX n | | 
(1+7)[Aix — bi] +2n£ 


3 


Now, if T = [184p In(m) /e*], then we get that for all i, AjX > b; — £, as required. 


3.3.3 Fractional Covering Problems 


In fractional covering problems, the framework is the same as above, with the crucial difference that the 
coefficient matrix A is such that Ax > 0 for all x € P, and b > 0. A €-approximation solution to this 
system is an x € P such that Ax > (1 — e)b. 

We assume without loss of generality (by appropriately scaling the inequalities) that b; = 1 for all 
rows, so that now we desire to find an x € P which satisfies the system within an additive € factor. Since 
for all x € P, we have Ax > 0, and since all b; = 1, we conclude that for any i, A;x — b; > —1. Thus, we 
assume that there is a (1, p )-bounded ORACLE for this problem. Now, applying Theorem 3.3, we get the 
following theorem. 
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Theorem 3.6. Suppose there exists a (1,p)-bounded ORACLE for the program Ax > 1 with x € P. Given 
an error parameter € > Q, there is an algorithm which computes a €-approximate solution to the program, 
or correctly concludes that it is infeasible, using O(p log(m) /€?) calls to the ORACLE, plus an additional 
processing time of O(m) per call. 


3.3.4 Fractional Packing Problems 


A fractional packing problem can be written as 


a?xeP: Ax <b 


where Ẹ is a convex domain such that Ax > 0 for all x € P, and b > 0. A €-approximate solution to this 
system is an x € P such that Ax < (1+ €)b. 

Again, we assume that b; = 1 for all i, scaling the constraints if necessary. Now by rewriting this 
system as 


?XxeP: —-Ax > -b 


we cast it in our general framework, and a solution x € P which satisfies this up to an additive € is a 
€-approximate solution to the original system. Since for all x € P, we have Ax > 0, and since all b; = 1, 
we conclude that for any i, —A;x +b; < 1. Thus, we assume that there is a (1, p )-bounded ORACLE for 
this problem. Now, applying Theorem 3.3, we get the following: 


Theorem 3.7. Suppose there exists a (1,p)-bounded ORACLE for the program —Ax > —1 with x € P. 
Given an error parameter € > 0, there is an algorithm which computes a €-approximate solution to the 
program, or correctly concludes that it is infeasible, using O(p log(m) /£°) calls to the ORACLE, plus an 
additional processing time of O(m) per call. 


3.4 Approximating multicommodity flow problems 


Multicommodity flow problems are represented by packing/covering LPs and thus can be approximately 
solved using the PST framework outlined above. The resulting flow algorithm is outlined below together 
with a brief analysis. Unfortunately, the algorithm is not polynomial-time because its running time is 
bounded by a polynomial function of the edge capacities (as opposed to the logarithm of the capacities, 
which is the number of bits needed to represent them). Garg and Könemann [29, 30] fixed this problem 
with a better algorithm whose running time does not depend upon the edge capacities. 

Here we derive the Garg-K6nemann algorithm using our general framework. This will highlight the 
essential new idea, namely, a reweighting of penalties to reduce the width parameter. Note that algorithm 
is not quite the same as in [29, 30] (the termination condition is slightly different) and neither is the proof; 
the running time bound is the same however. 

For illustrative purposes we focus on the maximum multicommodity flow problem. In this problem, 
we are given a graph G = (V,E) with capacities ce on edges, and set of k source-sink pairs of nodes. 
Let P be the set of all paths between the source-sink pairs. The objective is to maximize the total flow 
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between these pairs. The LP formulation is as follows: 


max ee 
peP 
VeeE: Y fp < ce, (3.10) 
pre 
VpEeP: fp = 0 


Here, the variable f, represents the flow on path p. 

Before presenting the Garg-K6nemann idea we first present the algorithm one would obtain by 
applying our packing-covering framework (Section 3.3) in the obvious way. 

First, note that by using binary search we can reduce the optimization problem to feasibility, by 
iteratively introducing a new constraint that gives a lower bound on the objective. So assume without 
loss of generality that we know the value Ft of the total flow in the optimum solution. Then we want to 
solve the following feasibility problem: 


EP: VecE: L aes 


pre 
where 


P= fr: VpeP: fp >20, prre] ! 
peP 
In this form, the feasibility problem given above is a packing LP, thus, we can apply the Multiplicative 
Weights algorithm of Section 3.3.4. 
As outlined in Section 3.3, the obvious algorithm would maintain at each step t a weight we) for 
each edge e. The ORACLE can be implemented by finding the flow in P which minimizes 


Ewe” L fp/ce = Lf >D we) ee. 
e P 


poe ecp 


The optimal flow is supported on a single path, namely, the path p“) € P that has minimum length, when 
every edge e € E is given length welt) /Ce. Thus in every round we find this path p® and pass a flow Ft 
on this path. Note that the final flow will be an average of the flows in each event, and hence will also 
have value F°?'. Costs for the edges are defined as in Section 3.3. 

Unfortunately the width parameter is 


= max max Co = F®™ ems 
P p e Lfl e /Cmin 


where Cmin is the capacity of the minimum capacity edge in the graph. The algorithm requires T = 
p ln(n)/£? iterations to get an (1 — €)-approximation to the optimal flow. The overall running time is 
O(F°'Tyy /Cmin) Where Tsp = O(mk) is the time needed to compute k shortest paths. As already mentioned, 
this is not polynomial-time since it depends upon 1/cmin rather than the logarithm of this value. 
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Now we describe the Garg-Kénemann modification. We continue to maintain weights w,“) for every 
edge e, where initially, wel!) = ] for all e. The costs are determined by flows as before, however, we 
consider a larger set of flows, viz., 


P := {f: YpEP: fp 20}. 


Note that we no longer need to use the value F°?'. Again, in each round we choose a flow f € P that is 
supported on the shortest path p”) € P with edge lengths we ICa: 

The main idea behind width reduction is the following: instead of routing the same flow Ft at each 
time step, we route only as much flow as is allowed by the minimum capacity edge on the path. In other 
words, at time ¢ we route a flow of value c®) on path p, where c“) is the minimum capacity of an edge 
on the path p”). The cost incurred by edge e is me) = cl) /Ce. (In other words, a cost of 1/ce per unit of 
flow passing through e.) The width is therefore automatically upper bounded by 1. 

We run the MW algorithm with n = €/2. The update rule in this setting consists of updating the 
weights of all edges in path p“) and leaving other weights unchanged at that step: 


cl) 
Veep): wt) = w,” 1+n:-—]. 
Ce 
The termination rule for the algorithm is to stop when as soon as for some edge e, the congestion 
felce > \n(m)/n?, where f, is the total amount of flow routed by the algorithm so far on edge e. 
3.4.1 Analysis 


We apply Theorem 2.5. Since we have me” € [0, 1] for all edges e and rounds t, we conclude that for any 
edge e, we have 


T T 
In(m 
Emt. pO > (1-0) E mÀ — ed G11) 
i=l i=l n 
We now analyze both sides of this inequality. In round ż, for any edge e, we have me) = ct) / Ce ife € p®, 
and 0 if e ¢ p. Thus, we have 
T 
ym = fe (3.12) 
= Ce 


where fe is the total amount of flow on e at the end of the algorithm, and 


elt) we Q) 


ym. © ye a Z = Yl. Beep ee (3.13) 
= Le welt) Le welt) 


t=1 t=1 t=1 
Now, suppose the optimum flow assigns fo™ flow to path p, and let F™! = X p be the total flow. 
For any set of edge lengths w./c., the shortest path p € P with these edge lengths aa 


r opt 
Ye We > Le We: Lp'>e EA _ Èp IF Eecp & we 
Leep ¢ — Lecp A Lecp w = 


opt opt 
2L f = FM, 
p 
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The first inequality follows because for any edge e, we have X} pse eg t < ce. The second inequality 
follows from the fact that p is the shortest path with edge lengths given by we/ce. Using this bound in 
(3.13), we get that 


T T 
Pangsa (3.14) 


where F = ELi c”) is the total amount of flow passed by the algorithm. 
Plugging (3.12) and (3.14) into (3.11), we get that 


= e l 
Foi = (1-n)max {2} 20). 


Ce 


We stop the algorithm as soon as there we have an edge e with congestion f,/ce > In(m)/n?, so when 
the algorithm terminates we have 


F te 
Fort 2 (1 2n)max { £) $ 


Now, C := max, { fe / Ce } is the maximum congestion of the flow passed by the algorithm. So, the flow 
scaled down by C respects all capacities. For this scaled down flow, we have that the total flow is 


F 
— > (1-2) F”, 
cal n) 


which shows that the scaled-down flow is within (1 — 27) = (1 — £) of optimal. 


Running time. In every iteration t of the algorithm, consider the minimum capacity edge e on the 
chosen path p"). It gets congested by the flow of value c ) = ce sent in that round. Since we stop the 
algorithm as soon as the congestion on any edge is at least In(m) /7?, any given edge can be the minimum 
capacity edge on the chosen path at most [In(m) /7n7] times in the entire run of the algorithm. Since there 
are m edges, the number of iterations is therefore at most m- [In(m)/n*] = O(mlog(m)/e7). 

Each iteration involves k shortest path computations. Recall that Tsp is the time needed for this. Thus, 
the overall running time is O(T,)-mlogm/e?). 


3.5 O(logn)-approximation for many NP-hard problems 


For many NP-hard problems, typically integer versions of packing-covering problems, one can compute 
a O(logn)-approximation by solving the obvious LP relaxation and then using Raghavan-Thompson [58] 
randomized rounding. This yields a randomized algorithm; to obtain a deterministic algorithm, deran- 
domize it using Raghavan’s [57] method of pessimistic estimators. 

Young [65] has given an especially clear framework for understanding these algorithms which as a 
bonus also yields faster, combinatorial algorithms for approximating integer packing/covering programs. 
He observes that one can collapse the three ideas in the algorithm above—LP solving, randomized 
rounding, derandomization—into a single algorithm that uses the multiplicative update rule, and does not 
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need to solve the LP relaxation directly. (Young’s paper is titled “Randomized rounding without solving 
the linear program.”) 

At the root of Young’s algorithm is the observation that the analysis of randomized rounding uses the 
Chernoff-Hoeffding bounds. These bounds show that the sum of bounded independent random variables 
X1,X2,...,Xn (which should be thought of as the random variables generated in the randomized rounding 
algorithm) is sharply concentrated about its mean, and are proved by applying Markov’s inequality to the 
variable e" (2X!) for some parameter 7. The key observation now is that the aforementioned application 
of Markov’s inequality bounds the probability of failure (of the randomized rounding algorithm) away 
from 0 in terms of E[e”i*))], Thus, one can treat E[e”2:*)] as a pessimistic estimator (up to scaling 
by a constant) of the failure probability, and derandomization can be achieved by greedily (and hence, 
deterministically) choosing the X; sequentially to decrease this pessimistic estimator. The resulting 
algorithm is essentially the MW algorithm: in each round f, the deterministic part of the pessimistic 
estimator, viz. e”»<**, plays the role of the weight. 

Below, we illustrate this idea using the canonical problem in this class, SET COVER. (A similar 
analysis works for other problems.) Since we have developed the multiplicative weights framework 
already, we do not detail Young’s original intuition involving Chernoff bound arguments and can proceed 
directly to the algorithm. In fact, the algorithm can be simplified so it becomes exactly the classical 
greedy algorithm, and we obtain a Inn-approximation, which is best-possible for this problem up to 
constant factors (assuming reasonable complexity-theoretic conjectures [23]). 

In the SET COVER problem, we are given a universe of n elements, say U = {1,2,3,...,n} anda 
collection © of subsets of U whose union equals U. We are required to pick the minimum number of 
sets from € which cover all of U. Let this minimum number be denoted OPT. The Greedy Algorithm 
picks subsets iteratively, each time choosing that set which covers the maximum number of uncovered 
elements. 

We analyze the Greedy Algorithm in our setup as follows. Each element of the universe represents a 
constraint that the union of sets picked by the algorithm must cover it. Following the guidelines given 
in the beginning of Section 3, we cast the problem in our framework by letting decisions correspond to 
elements in the universe, and costs determined by sets C; € C. The cost of the constraint corresponding to 
element i for a given set C; is 1 if i € C}; and 0 otherwise. 

To translate the greedy algorithm to our framework, suppose we run the Multiplicative Weights 
Update algorithm with this setup with 7 = 1. Since the analysis in the proof of Theorem 2.1 technically 
requires n < 1/2, in the following we repeat the same potential function analysis for the current setting 
for n = 1. For n = 1, the update rule w;“t+!) = w;‘ (1 — nm;®) implies that elements that have been 
covered so far have weight 0 while all the rest have weight 1. Thus, the distribution p” is simply the 
uniform distribution on the uncovered elements until time ¢. Since cost of an element for a given set is 
1 if it is in the set and O otherwise, the set that maximizes the expected cost under p° is the one that 
maximizes the number of uncovered elements. The resulting algorithm is the Greedy Set Cover algorithm. 

Since OPT sets cover all the elements, for any distribution p1, p2,..., pn on the elements, one set 
must cover at least 1/OPT fraction of elements. This implies that if we choose set C in round ż that 
maximizes the number of uncovered elements, we have 


m) -p® = max )° pi > 1/OPT. 
C iec 
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Following the analysis of Theorem 2.1, the change in potential for each round is: 
pCt) ~eO 4 —nm®- pO) < Be M/OPT = He V/OPT 


The strict inequality holds because m“ ).p) > Oas long as there are uncovered elements. Thus, the 
potential drops by a factor of e~!/OPT every time. 
We run this as long as some element has not yet been covered. We show that T = [Inn]OPT iterations 
suffice, which implies that we have a [Inn] approximation to OPT. We have 
BT) < @ll)e-T/OPT — yo—linnlOPT/OPT _ pemn] < 1. 


Note that with n = 1, (7+) is exactly the number of elements left uncovered after T iterations. So we 
conclude that all elements are covered. 


3.6 Learning theory and boosting 


Boosting [60]—the process of combining several moderately accurate rules-of-thumb into a single 
highly accurate prediction rule—is a central idea in Machine Learning today. Freund and Schapire’s 
AdaBoost [26] uses the Multiplicative Weights Update Rule and fits in our framework. Here we explain 
the main idea using some simplifying assumptions. 

Let X be some set (domain) and suppose we are trying to learn an unknown function (concept) 
c : X — {0,1} chosen from a concept class C. Given a sequence of training examples (x,c(x)) where x is 
generated from a fixed but unknown distribution D on the domain X, the learning algorithm is required to 
output a hypothesis h : X — {0,1}. The error of the hypothesis is defined to be E, ~o ||h(x) — c(x)|]. 

A strong learning algorithm is one that, for every distribution D, given €,6 > 0 and access to random 
examples drawn from D, outputs with probability at least 1 — 6 a hypothesis whose error is at most €. 
Furthermore, the running time is required to be polynomial in 1/¢, 1/6 and othe relevant parameters. A 
y-weak learning algorithm, for some given y > 0, is an algorithm satisfying the same conditions but the 
error can be as high as 1/2 — y. Boosting shows that if a y-weak learning algorithm exists for a concept 
class, then a strong learning algorithm exists. (The running time of the algorithm and the number of 
samples may depend on Y.) 

We prove this result in the so-called boosting by sampling framework, which uses a fixed training 
set S of N examples drawn from the distribution D. The goal is to make sure that the final hypothesis 
erroneously classifies at most € fraction of this training set. Using VC-dimension theory (see [26]) one 
can then show that if the weak learner produces hypotheses from a class H of bounded VC-dimension, 
and if N is chosen large enough (in terms of the error and confidence parameters, and the VC-dimension 
of the hypothesis class H), then with probability at least 1 — ô over the choice of the sample set, the error 
of the hypothesis over the entire domain X (under distribution D) is at most 2€. 

The idea in boosting is to repeatedly run the weak learning algorithm on different distributions defined 
on the fixed training set S. The final hypothesis has error € under the uniform distribution on S. We run 
the MW algorithm with n = y for T = | (2/7*)In(1/e)| rounds. The decisions correspond to samples in 
S and costs are specified by a hypothesis generated by the weak learning algorithm, in the following way. 
If hypothesis A is generated, the cost for decision point x is 1 or 0 depending on whether h(x) = c(x) or 
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not. In other words, the cost vector m (indexed by x rather than i) is specified by mx = 1 — |h(x) —c(x)]. 
Intuitively, we want the weight of an example to increase if the hypothesis labels it incorrectly. 

In each iteration, the algorithm presents the current distribution p” on the examples to the weak 
learning algorithm, and in return obtains a hypothesis A?) whose error with respect to the distribution p” 
is not more than 1/2 — y, in other words, the expected cost in each iteration satisfies 


The algorithm is run for T rounds, where T will be specified shortly. The final hypothesis, Afinal, labels 
x € X according to the majority vote among A® (x), hP (x), ..., h (x). 
Let E be the set of x € S incorrectly labeled by gna. The total cost of each x € E, 


Ym, = E1- ~)-c@l < 5 
t t 


since the majority vote gives an incorrect label for it. Instead of applying Theorem 2.1, we apply 
Theorem 2.4 which gives a more nuanced bound. The set ẸF in this theorem is simply the set of all possible 
distributions on S. Choosing p to be the uniform distribution on E, we get 


RE(p || p) al, log(n/|E]) 
= a rae a" 


(+7) T<} mO- p” < (1+) Ym” -p4 
t t 


Since n = y and T = | (2/7’)In(1/e)], the above inequality implies that the fraction of errors, |E|/n, is 
at most € as desired. 


3.7 Hard-core sets and the XOR Lemma 


A boolean function f : X — {0,1}, where X is a finite domain, is y-strongly hard, for circuits of size S if 
for every circuit C of size at most S, 


Here x € X is drawn uniformly at random, and y < 1/2 is a parameter. For some parameter € > 0, it is 
€-weakly hard for circuits of size S if for every circuit C of size at most S, we have 
PriC(x) = FQ] < Ie. 

Now given f : {0,1}" — {0,1}, define f®* : {0,1}" — {0,1} to be the boolean function obtained 
by dividing up the input nk-bit string into k blocks of n bits each in the natural way, applying f to each 
block in turn, and taking the XOR of the k outputs. Yao’s XOR Lemma [64] shows that if f is €-weakly 
hard against circuits of size S then f is y+ (1 — ¢)*-strongly hard for circuits of size Se?7°/8. 

The original proofs were difficult but Impagliazzo [40] suggested a simpler proof that as a byproduct 
proves an interesting fact about weakly hard functions: there is a reasonably large subset (at least € 
fraction of X) of inputs on which the function behaves like a strongly hard function, for somewhat smaller 
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circuit size S’. This subset is called a hard-core set. For technical reasons, to prove such a result, it 
suffices to exhibit a “smooth” distribution p on X (precise definition given momentarily), such for any 
circuit C of size at most S’, we have 
1 
Prca) =f@)] < z+ 

An €-smooth distribution p (the same € from the weak hardness assumption on f) is one that doesn’t 
assign too much probability to any single input: py < 1/(e|X|) for any x € X. Such a distribution be 
decomposed as a convex combination of probability distributions over subsets of size at least £€|X |. 

Klivans and Servedio [49] observed that Impagliazzo’s proof is an application of a boosting algorithm. 
The argument is as follows. We are given a boolean function f that is €-weakly hard for circuits of size S. 
Assume for the sake of contradiction f is not y-strongly hard on any smooth distribution for circuits of 
some size S’ < S. Then for any smooth distribution, we can find a small circuit of size S’ that calculates f 
correctly with probability better than 1/2 + y when inputs are drawn from the distribution. Treat this as a 
weak learning algorithm, and apply boosting. Boosting combines the small circuits of size S’ found by 
the weak learning algorithm into a larger circuit that calculates f correctly with probability at least 1 — € 
on the uniform distribution on X, contradicting the fact f is €-weakly hard, if we ensure that the size of 
the larger circuit is smaller than S. This can be done if the S’ is set to O(S/T), where T is the number of 
boosting rounds. 

With this insight, the problem now boils down to designing boosting algorithms that (a) are able to 
deal with smooth distributions on inputs and (b) have a small number of boosting rounds. The lower the 
number of boosting rounds, the better circuit size bound we get for showing y-strong hardness. 

The third author [45] has shown how to construct such a boosting algorithm using the MW algorithm 
for restricted distributions (see Section 2.2). This boosting algorithm obtains the best known parameters 
in hard-core set constructions directly without having to resort to composing two different boosting 
algorithms as in [49]. This technique was extended in [8] to obtain uniform constructions of hard-core 
sets with the best known parameters. 

We describe the boosting algorithm of [45] now. The main observation is that the set of all €-smooth 
distributions is convex. Call this set P. Then, exactly as in Section 3.6, the boosting algorithm simply 
runs the MW algorithm, with the only difference being that the distributions it generates are restricted 
to be in F using relative entropy projections, as in the algorithm of Section 2.2. We can now apply the 
same analysis as in Section 3.6. Following this analysis, let E C X be the set of inputs on which the final 
majority hypotheses incorrectly computes f. Now we claim that |E| < €|X|: otherwise, since the uniform 
distribution on E is €-smooth, we obtain a contradiction for T = | (2/7*)In(1/e)| as before. Thus, the 
final majority hypothesis computes f correctly on at least a 1 — € fraction of inputs from X. 

This immediately implies the following hard-core set existence result, which has the best known 
parameters to date: 


Theorem 3.8. Given a function f : X — {0,1} that is €-weakly hard for circuits of size S, there is a 
subset of X of size at least €|X| on which f is y-strongly hard for circuits of size 
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3.8 Hannan’s algorithm and multiplicative weights 


Perhaps the earliest decision-making algorithm which attains bounds similar to the MW algorithm 
is Hannan’s algorithm [34], dubbed “follow the perturbed leader” by Kalai and Vempala [43]. This 
algorithm is given below. 


Initialization: Pick random numbers 7 ,7r2,...,7, one for each decision. 
For t = 1,2,...,T: 


1. Choose the decision which minimizes the total cost including random initial cost: 
i) =argmin {1:0 +r; } (3.15) 
l 


where L;® = Lasi mj is the total cost so far for the ith decision. 


2. Observe the costs of the decisions m“). 


Figure 3: Hannan’s algorithm. 
In this section we show that for a particular choice of random initialization numbers {r;}, the algorithm 
above exactly reproduces the multiplicative weights algorithm, or more precisely the Hedge algorithm as 


in Section 2.1. This observation is due to Adam Kalai [42]. 


Theorem 3.9 ([42]). Let u1,...,Un be n independent random numbers chosen uniformly from [0,1], and 
consider the algorithm above with r; = qin In 1. Then for any decision j, we have 


enh; 


Pr [i = j| = yeh 


Proof. By monotonicity of the exponential function, we have: 


1 1 1 
arg min {ui + imni) = argmin {91,0 +1nn | 
l l j 


Ui Ui 
-nL ® 
e nLi 
arg max j 
l In 7E 


= argmax a 
1 Si 


Where w;”®) = eL” and Si = Int are independent exponentially distributed random variables with 
mean 1. The result now follows from Lemma 3.10 below. 
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Lemma 3.10. Let w1,...,W, be arbitrary non-negative real numbers, and let s1,...,5, be independent 
exponential random variables with mean 1. Then 


Wi wj 
Pr argmax =i = a 
i Si Liwi 


Proof. The probability density function of s; is e~*'. Conditioning on the value of s;, for a particular i £ j 


we have 
o> ans 
s panen —Sj o—= W; 
sj = / ws; E idsi =e “i. 
Si = 


w; 


Pr [viž j: na 


Si 


Sj 


Integrating to remove the conditioning on sj, we have 


Wis} We 
.,. Wi Wj en eee = ai wj 
Pr wes: "<H = f e tAI wj e “ds; = eo! i dsj = J : 
i j sj=0 s j 


3.9 Online convex optimization 


Online convex optimization is a very general framework that can be applied to many of the problems 
discussed in the applications section and many more. Here “online” means that the algorithm does not 
know the entire input at the start, and the input is presented to it in pieces over many rounds. In this 
section we describe the framework and the central role of the multiplicative weights method. For a much 
more detailed treatment of online learning techniques see [16]. 

In online convex optimization, we move from a discrete decision set to a continuous one. Specifically, 
the set of decisions is a convex, compact set K C R”. In each round t = 1,2,..., the online algorithm 
is required to choose a decision, i.e., point p) € K. A convex loss function f”) is presented, and 
the decision maker incurs a loss of f" ) (p )), The goal of the online algorithm A is to minimize loss 
compared to the best fixed offline strategy. This quantity is called regret in the game theory and machine 
learning literature. 


The basic decision-making problem described in Section 2 with n discrete decisions is recovered as a 
special case of online convex optimization as follows. The convex set K is the n-dimensional simplex 
corresponding to the set of all distributions over the n decisions, and the payoff functions f (© are defined 
as fO (p®©) =m - p given the cost vector m"), It also generalizes other online learning problems such 
as the online portfolio selection problem and online routing (see [35] for more discussion on applications). 
Zinkevich [66] gives algorithms for the general online convex optimization problem. More efficient 
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algorithms for online convex optimization based on strong convexity of the loss functions have also been 
developed [37]. 

We now describe how to use the Multiplicative Weights algorithm to the online convex optimization 
problem for the special case where K is the n dimensional simplex of distributions on the coordinates. 
The advantage is that this algorithm has much better dependance on the dimension n than Zinkevich’s 
algorithm (see [35] for more details). This algorithm has several applications such as in online portfolio 
selection [38]. Here is the algorithm. First, define 


p := maxmax||V/f(p) |x. 
pek t 
where V fO (p) is the (sub)gradient of the function f (© at point p. The parameter p is called the width of 


the problem. Then, run the standard Multiplicative Weights algorithm with n = ,/In(n)/T and the costs 
defined as 


eG al FOPO) 


where p” is the point played in round r. Note that for all ¢ and all i, |m;®| < 1 as required by the MW 
algorithm. 


Theorem 3.11. After T rounds of applying the Multiplicative Weights algorithm to the online convex 
optimization framework, for any p € K we have 


T T 
LOPO) -E OP) < 2pvin@)r. 
t=1 t=1 
Proof. If f : K — R is a differentiable convex function, then for any two points p,q € K we have 


f(a) > f(p)+Yf(p) (q-p), 


where V f (p) is the gradient of f at p. Rearranging we get 


f(p)— f(a) < Vf(p)- (p-a). (3.16) 
Applying Corollary 2.2, we get that for any p € K, 


T T 
ym -p” < Yim © +m" Dpt < Ym pines, (3.17) 
= t=1 
since ||m")||.. < 1. Now we have 
T 
L p-p < L pyg -p) (from (3.16)) 
t t=1 
T 
= ¥ pm". (p —p) 
t=1 
< pory (from (3.17)) 


Substituting n = y/lnn/T completes the proof. 
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3.10 Other applications 
3.10.1 Approximately solving certain semidefinite programs 


Semidefinite programming (SDP) is a special case of convex programming. A semidefinite program is 
derived from a linear program by imposing the additional constraint that some subset of n? variables form 
an n x n positive semidefinite matrix. Since the work of Goemans and Williamson [31], SDP has become 
an important tool in design of approximation algorithms for NP-hard optimization problems. Though this 
yields polynomial-time algorithms, their practicality is suspect because solving SDPs is a slow process. 
Therefore there is great interest in computing approximate solutions to SDPs, especially the types that 
arise in approximation algorithms. Since the SDPs in question are being used to design approximation 
algorithms anyway, it is permissible to compute approximate solutions to these SDPs. 

Klein and Lu [47] use the PST framework (Section 3.3) to derive a more efficient 0.878-approximation 
algorithm for MAX-CUT than the original SDP-based method in Goemans-Williamson [31]. The main 
idea in Klein-Lu is to approximately solve the MAX-CUT SDP. However, their idea does work very well 
for other SDPs. The main issue is that the width p (see 3.3) is too high for certain SDPs of interest. 

To be more precise, an SDP feasibility problem is given by: 


Vj = 12h. AjeX > bj, 
XEP. 


Here, we use the notation Ae B = };;;A;;jB;j to denote the scalar product of two matrices thinking of 


them as vectors in R” . The set P = {X € R”*” |X =0, Tr(X) < 1} is the set of all positive semidefinite 
matrices with trace bounded by 1. 

The Plotkin-Shmoys-Tardos framework (see Section 3.3) is suitable for approximating SDPs since 
all constraints are linear, and the oracle given in (3.7) can be implemented efficiently by an eigenvector 
computation. To see this, note that the oracle needs to decide, given a probability distribution p on the 
constraints, if there exists an X € P such that } ; pjAj;eX >}; pjbj. This can be implemented by solving 
the following optimization problem: 


max : $ piAjeX. 
j 


xEP 


It is easily checked that an optimal solution to the above optimization problem is given by the matrix 
X = vy! where v is unit eigenvector corresponding to the largest eigenvalue of the matrix Y jPiAj- 

The Klein-Lu approach was of limited uses in many cases because it does not do too well when the 
additive error € is required to be small. (They were interested in the MAX-CUT problem, where this 
problem does not arise. The reason in a nutshell is that in a graph with m edges, the maximum cut has at 
least m/2 edges, so it suffices to compute the optimum to an additive error € that is a fixed constant.) We 
have managed to extend the multiplicative weights framework to many of these settings to design efficient 
algorithms for SDP relaxations of many other problems. The main idea is to apply the Multiplicative 
Weights framework in a “nested” fashion: one can solve a constrained optimization problem by invoking 
the MW algorithm on an subset of constraints (the “outer” constraints) in the manner of Section 3.3, 
where the domain is now defined by the rest of the constraints (the “inner” constraints). The oracle can 
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now be implemented by another application of the MW algorithm on the inner constraints. Alternatively, 
we can reduce the dependence on the width by using the observation that the Lagrangian relaxation 
problem on the inner constraints can be solved by the ellipsoid method. For details of this method, refer 
to our paper [3]. For several families of SDPs we obtain the best running time known. 

More recently Arora and Kale [5] have designed a new approach for solving SDPs that involves a 
variant of the multiplicative update rule at the matrix level; see Section 5 for details. 

Yet another approach for approximately solving SDPs is by reducing the SDP into a maximization 
problem of a single concave function over the PSD cone. The latter problem can be approximated 
efficiently via an iterative greedy method. The resulting algorithm is extremely similar to the MW-based 
algorithm of [3], however its analysis is very different, see [36] for more details. 


3.10.2 Approximating graph separators 


A recent application of the multiplicative weights method is a combinatorial algorithm for approximating 
the SPARSEST CUT of graphs [4]. This is a fundamental graph partitioning problem. Given a graph 
G = (V,E), the expansion of a cut (S, S) where S C V and S = V \ S, is defined to be 


lz(S,5)| 
min{ [Sl ISTF 


Here, E (S, S) is the set of edges with one end point in S and the other in S. The SPARSEST CUT problem 
is to find the cut in the input graph of minimum expansion. This problem arises as a useful subroutine in 
many other algorithms, such as in divide-and-conquer algorithms for optimization problems on graphs, 
layout problems, clustering, etc. Furthermore, the expansion of a graph is a very useful way to quantify 
the connectivity of a graph and has many important applications in computer science. 

The work of Arora, Rao and Vazirani [7] gave the first O(,/logn) approximation algorithm to the 
SPARSEST CUT problem. However, their best algorithm relies on solving an SDP and runs in Õ(n*-5) time. 
They also gave an alternative algorithm based on the notion of expander flows, which are multicommodity 
flows in the graph whose demand graph has high expansion. However, their algorithm was based on the 
ellipsoid method, and was thus quite inefficient. In the paper [4], we obtained a much more efficient 
algorithm for approximating the SPARSEST CUT problem to an O(,/logn) factor in O(n”) time using 
the expander flow idea. The algorithm casts the problem of routing an expander flow in the graph as a 
linear program, and then checks the feasibility of the linear program using the techniques described in 
Section 3.3. The oracle for this purpose is implemented using a variety of techniques: the multicommodity 
flow algorithm of Garg and Könemann [29, 30] (and its subsequent improvement by Fleischer [24]), 
eigenvalue computations, and graph sparsification algorithms of Benczur and Karger [9] based on random 
sampling. 


3.10.3 Multiplicative weight algorithms in geometry 


The multiplicative weight idea has been used several times in computational geometry. Chazelle [19] 
(p. 6) describes the main idea, which is essentially the connection between derandomization of Chernoff 
bound arguments and the exponential potential function noted in Section 3.5. 
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The geometric applications consist of derandomizing a natural randomized algorithm by using a 
deterministic construction of some kind of small set cover or low-discrepancy set. Formally, the analysis 
is similar to our analysis of the Set Cover algorithm in Section 3.5. Clarkson used this idea to give a 
deterministic algorithm for Linear Programming [20, 21]. Following Clarkson, Brénnimann and Goodrich 
use similar methods to find Set Covers for hypergraphs with small VC dimension [11]. 

The MW algorithm was also used in the context of geometric embeddings of finite metric spaces, 
specifically, embedding negative-type metrics (i.e., a set of points in Euclidean space such that the 
squared Euclidean distance between them also forms a metric) into 41. Such embeddings are important 
for approximating the important non-uniform SPARSEST CUT PROBLEM. 

The approximation algorithm for SPARSEST CUT in Arora et al. [7] involves a “Structure Theorem’ 
This structure theorem was interpreted by Chawla et al. [18] as saying that any n-point negative-type 
metric with maximum distance O(1) and average distance Q(1) can be embedded into £; such that 
average ¢, distance is Q(1/,/logn). Then they used the MW algorithm to construct an embedding into 
Lı in which every pair of points that have negative-type metric distance Q(1) have 44 distance that is off 
by at most an O(./logz) factor of the original. Using a similar idea for other distances and combining 
the resulting embeddings they obtained an embedding of the negative-type metric into 44 in which every 
distance distorts by at most a factor O(log? / tn). Arora et al. [6] gave a more complicated construction to 


improve the distortion bound to O(,/log(n) loglogn), leading to a O(,/log(n) log logn)-approximation 
for non-uniform sparsest cut. 


> 


3.11 Design of competitive online algorithms 


Starting with the work of Alon, Awerbuch, Azar, Buchbinder and Naor [1], a number of competitive online 
algorithms have been developed using an elegant primal-dual approach which involves multiplicative 
weight updates. While the analysis of their algorithms seems to be beyond our general framework, 
we briefly mention this work without going into many details. We refer the readers to the survey by 
Buchbinder and Naor [14] for an extensive discussion of the topic. 

Several online problems such as the ski rental problem, caching, load balancing, ad auctions, etc. can 
be cast (in their fractional form) as a linear program with non-negative coefficients in the constraints and 
cost, where either the constraints or variables arrive online one by one. The online problem is to maintain 
a feasible solution at all times with a bounded competitive ratio, i. e., ensuring that the cost of the solution 
maintained is bounded in terms of the cost of the optimal solution in each round. The main difficulty 
comes from the requirement that the solution maintained is monotonic in some sense (for example, the 
variables are never allowed to decrease). 

Buchbinder and Naor [15] give a algorithm based on the primal-dual method that obtains good 
competitive ratios in this scenario. At the heart is a multiplicative weight update rule. Imagine we have a 
covering LP, i. e., all constraints are of the form a-x > 1, where a has non-negative coefficients, and x is 
the vector of variables. The cost function has non-negative coefficients as well. Constraints arrive one at 
a time in each round and must be satisfied by the current solution. The requirement is that no variable can 
decrease from round to round. 

In every round, the algorithm increases primal variables in the new constraint using multiplicative 
updates and the corresponding dual variable additively (so essentially, each primal variable is the 
exponential of the dual constraint that it corresponds to, much like the Plotkin-Shmoys-Tardos algorithms 
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of Section 3.3). This is done until the constraint gets satisfied. Clearly, we maintain a feasible solution 
in each round, and the variables never decrease. The analysis goes by bounding the increase in the 
primal cost in terms of the dual cost (this is an easy consequence of the multiplicative update, in fact, 
the multiplicative update can be derived using this requirement). The competitive ratio is obtained by 
showing that the dual solution generated simultaneously, while infeasible, is not far from being feasible, 
i.e., scaling down the variables by a certain factor makes it feasible. This gives us a bound on the 
competitive ratio via weak duality. 


4 Lower bound 


Can our analysis of the Multiplicative Weights algorithm be improved? This section shows that the 
answer is no, not only for the MW algorithm itself, but for any algorithm which operates in the online 
setting we have described. A similar lower bound was obtained by Klein and Young [48]. 

Technically, we prove the following: 


Theorem 4.1. For any online decision-making algorithm with n > 2 decisions, there exists a sequence 
of cost vectors mm) ,...,m@) € {0,1}" such that min; $; m) = Q(T), and if the sequence of 
distributions over decisions produced by the algorithm are p“),p),...,p'"), then we have 


T 
ym? -p > min Ym +Q(VT Inn). 
tel 


t 


Since in the theorem above we have min; EL m;® = Q(T), Theorem 2.1 implies that by choosing 
the optimal value of 7, viz. n = O(, /log n/T) we have 


T T 
Em”. p” < min È mj) +O(VT Inn). 
l =l 


Il 
en 


Hence our analysis is tight up to constants in the additive error term. Moreover, the above lower bound 
applies to any algorithm, efficient or not. 


Proof. The proof is via the probabilistic method. We construct a distribution over the costs so that the 
required bound is obtained in expectation. Interestingly, the distribution is independent of the actual 
algorithm used. 

We now specify the costs of the decisions. The cost of decision 1 is set to 1/2 for all rounds t. For 
any decision i > 1, we construct its cost via the following random process: in each iteration t, choose its 
cost to be either one or zero uniformly at random, i. e., m;" Le {0,1} with probability of each outcome 
being 1/2. 

The expected cost of each decision is 1/2. Hence, the expected cost of the chosen decision is also 
1/2 irrespective of the algorithm’s distribution p, and hence: 


T T 
E| ym? -p} = = (4.1) 
t=1 
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For every decision i, define X; = Z7; m;. Note that X, = T /2, and for i > 1, we have X; ~ B(T, 1/2), 
where B(T, 1/2) is the binomial distribution with T trials and both outcomes equally likely. 
We use the following standard concentration lemma (see Proposition 7.3.2 in [54]): 


Lemma 4.2. Let X ~ B(T,1/2) be a binomial random variable with T trials and probability 1/2. Then 
fort € [0,T /8]: 


T 1 2 
Pr|X <—-t| > —e 1/7, 
r| -3 |= as 


We claim that the expectation of the cost of the best decision, viz., E[min, Xj], in our construction is 


T /2—Q(/T logn). Lett = (1/4),/T In(n— 1). We have 


n—1 
Pr [minx > Z= = [TPs >$ >— -+| < (1- Re") < et/I5 < 0.95. 


The first equality above is by the independence of the X;, and the first inequality is by Lemma 4.2. Thus, 
with probability at least 1 — 0.95 = 0.05, min; X; < (T /2) —t. Since min; X; is always at least T /2, we get 


T T s 1 
i j > ä B a A ai = = — 
E[minX;] > 0.95 5 +0.05 G +t) = 5 +35 TIn(n—1). 
It follows that 
E ny? OW sa T In(n—1) (4.2) 
mi > =- — n—1). ; 
| 2 80 


From (4.1) and (4.2), we have 
T T 1 
E E m” p” — min Ym” > —/TIn(n—-1). 
t=1 l t=] 80 


Since Z7 m -p® — min; 7, mj") < T, by Markov’s inequality we conclude that 


L L 1 /In(n—1) 
(@) pO — mi KO) penny a 
Pr 3 m p on e a T In(n— j 2 


By the Hoeffding bound [2], for any decision i > 1, we have 


T 
Pr È mj) < d < exp(—T/32). 
i=l 


By the union bound, we have 


T 
= j z me ot ef Nee = 
Pr [s ym de d < nexp(—T/32) < 160 T’ 
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for large enough T. This implies that there exists a sequence m“!),m(),...,m(7) for which 


T 
min $ m” >T/4 
t t=] 


and 


Mns 
3 
a 

| 

B 
=] 

im 
3 
V 
N 
5 

Ss 

| 


il 
= 
a 

| 
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5 The Matrix Multiplicative Weights algorithm 


In the preceding sections, we considered an online decision making problem. We refer to that setting as 
the basic or scalar case. Now we briefly consider a different online decision making problem, which is 
seemingly quite different from the previous one, but has enough structure that we can obtain an analogous 
algorithm for it. We move from cost vectors to cost matrices, and from probability vectors to density 
matrices. For this reason, we refer to the current setting as the matrix case. We call the algorithm presented 
in this setting the Matrix Multiplicative Weights algorithm. The original motivation for our interest in this 
matrix setting is that it leads to a constructive version of SDP duality, just as the standard multiplicative 
weights algorithm can be viewed as a constructive version of LP duality. In fact the standard algorithm is 
a special subcase of the algorithm in this section, namely, when all the matrices involved are diagonal. 

Applications of the matrix multiplicative weights algorithm include solving SDPs [5], derandomizing 
constructions of expander graphs, and obtaining bounds on the sample complexity for a learning problem 
in quantum computing. The celebrated result of Jain et al. [41] showing that QIP =PSPACE relied on 
the Matrix MW algorithm. Here QIP is the set of all languages which have quantum interactive proofs. 
These applications are unfortunately beyond the scope of this survey; please see the third author’s Ph. D. 
thesis [44] and Jain et al. [41] for details. The algorithm given here is from a paper of Arora and Kale [5]. 
A very similar algorithm was discovered independently slightly earlier by Warmuth and Kuzmin [63], 
and is based on the even earlier work of Tsuda, Ratsch, and Warmuth [62]. 

We stick with our basic decision-making scenario but decisions now correspond to unit vectors v in 
S"~! | the unit sphere in R”. As in the basic case, in every round, our task is to pick a decision v € S’!. 
At this point, the costs of all decisions are revealed by nature. These costs are not arbitrary, but they are 
correlated in the following way. A cost matrix M € R”*" is revealed, and the cost of a decision v is then 
v! My. We assume that the costs of all decisions lie in [—1,1]. Again, as in the basic case, this is the 
only assumption we make on the way nature chooses the costs; indeed, the costs could even be chosen 
adversarially. Equivalently, we assume that all the eigenvalues of the matrix M are in the range [—1, 1]. 

This game is repeated over a number of rounds. Let t = 1,2,...,7 denote the current round. In each 
round f, we select a distribution D over the set of decisions S”! and select a decision v randomly 
from it (and use his advised course of action). At this point, the costs of all decisions are revealed by 
nature via the cost matrix M“). The expected cost to the algorithm for choosing the distribution DC) is 


E epo) [Wv My] = Eepo) IM evv!'] = M” e E -po [vv ']. 
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Recall that we are using the notation A e B = )’;,A;;Bi; to denote the scalar product of two matrices 


thinking of them as vectors in R”. Define the matrix P := E epo [vv ']. Note that P is positive 
semidefinite: this is because it is a convex combination of the elementary positive semidefinite matrices 
vv'. Let Tr(A) denote the trace of a matrix A. Then, Tr(P) = 1, again because for all v, we have 
Tr(vv') =||v|/? = 1. A matrix P which is positive semidefinite and has trace 1 is called a density matrix. 

We will only be interested in the expected cost to the algorithm, and all the information required for 
computing the expected cost for a given distribution D over S’~! is contained in the associated density 
matrix. 

Thus, in each round ż, we require our online algorithm to choose a density matrix P), rather than a 
distribution D over S’~!. The distribution is implicit in the choice of P®). The eigendecomposition of 
pl) = ”_, Aiviv; , where A; is an eigenvalue corresponding to the unit eigenvector v;, gives one such 
distribution (a discrete one): vector v; gets probability 2;. We then observe the cost matrix M revealed by 
nature, and suffer the expected cost M”) e P®). After T rounds, the total expected cost is > M eP“ ), 
while the best fixed decision in hindsight corresponds to the unit vector v which minimizes £7; v' Mv. 
Since we minimize this quantity over all unit vectors v, the variational characterization of eigenvalues 
implies that this minimum cost is exactly the minimum eigenvalue of Z1; MC ), denoted by A, (MO). 
Our goal is to design an online algorithm whose total expected cost over the T rounds is not much more 
than the cost of the best decision. 

Consider the following generalization of the basic MW algorithm, called the Matrix Multiplicative 
Weights algorithm (Matrix MW). It uses the notion of matrix exponential, exp(A) := £% AŻ/i!. The key 
point here is that regardless of the matrix A, its exponential exp(A) is always positive definite. Note that 
in case all matrices involved are diagonal, the Matrix Multiplicative Weights algorithm exactly reduces to 
the Hedge algorithm. 


Matrix Multiplicative Weights algorithm 
Initialization: Fix an n < }. Initialize the weight matrix wi) =g. 
For t = 1,2,...,T: 


1. Use the density matrix P“) = w , where ® = Tr(W). 


2. Observe the cost matrix M“). 


3. Update the weight matrix as follows: 


wt!) = exp(-nEi- M”). 


Figure 4: The Matrix Multiplicative Weights algorithm. 


The following theorem bounds the total expected cost of the Matrix Multiplicative Weights algorithm 
(given in Figure 4) in terms of the cost of the best fixed decision. This theorem is completely analogous 
to Theorem 2.3, and in fact, Theorem 2.3 can be obtained directly from this theorem in the case when all 
matrices involved are diagonal. We omit the proof of this theorem; again, it is along the lines of the proof 
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of Theorem 2.3, and uses ®“) as a potential function. The analysis is based on the matrix analogue of 
the real number inequality exp(—nx) < 1 — nx + n?x? for |nx| < 1: if X is a matrix such that ||7X|| < 1, 
then 

—exp(—nX) < I—-nX+7n’X’. 


Also, we need an additional inequality from statistical mechanics, the Golden-Thompson inequality [32, 
61], which states that for two matrices A and B, we have 


Tr(exp(A+B)) < Tr(exp(A) exp(B)). 
See [44, 5] for further details. 


Theorem 5.1. Jn the given setup, the Matrix Multiplicative Weights algorithm guarantees that after T 
rounds, for any decision v, we have 


Mns 


T T 
MO ePO < YviMOv+n $ (MO) e PO + sue (5.1) 
a n 


t=1 t=1 


5.1 Applications in solving SDPs 


In this section, we mention the application of the Matrix MW algorithm to solving SDPs without going 
into any details. We refer the interested reader to [44, 5] for further details and other applications. 
Consider the following quite general form of feasibility problem using SDP: 


Vj=1,2,....m: AjeX > 0, 
TeX) = 1, 
X > 0. (5.2) 


Note if all A; matrices and X were diagonal, then this exactly reduces to the LP (3.1). 

Now suppose, as in Section 3.1, that there exists a large-margin solution, i. e., a density matrix X* 
such that for all j, we have A je X* > e. Then just as in Section 3.1, we can use the Matrix MW algorithm 
to find a good solution, i. e., a density matrix X such that for all j we have A; e X > 0. We now need to 
define p = max; ||A;||. 

We run the Matrix MW algorithm (in the gain form). In each round, we set X to be the current density 
matrix Pp’), and check if it is a good solution. If not, and there is a constraint j such that A; e X < 0, then 
we set M®) = (1/p)Aj. Note that in case all the matrices involved in the SDP are diagonal, then this 
algorithm reduces to the Winnow algorithm of Section 3.1.” 

The analysis is exactly the same as in Section 3.1. This analysis (using the gain form of Theorem 5.1 
corresponding to Theorem 2.5), implies that in at most [4p? Inn/ e°] iterations, we find a good solution. 

This technique was used in [5] to obtain significantly faster algorithms than previously known 
for approximating several combinatorial optimization problems such as SPARSEST CUT, BALANCED 
SEPARATOR (both in directed and undirected graphs), MIN UNCUT, and MIN 2CNF DELETION. This 
also gives the first near-linear time algorithm for solving the MAX CUT SDP of [31] to any constant 
approximation factor. 


2A minor difference being that we get the version of Winnow based on the Hedge algorithm, rather than the MW algorithm. 
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